from ipypublish.scripts.ipynb_latex_setup import *
The Cancer Genome Atlas (TCGA) has collected mutation and expression data for over 20,000 tumor samples, but most subtypes of cancer have few normal tissue samples to compare against. We uniformly computed expression data for both TCGA and The Genotype Tissue Expression Constortium (GTEx), which collected expression data from thousands of normal tissue samples, to create a large repository of cancer and normal expression data free of computational batch effects. Combined expression data was validated by identifying known cancer phenotypes for several antineoplastic drug targets and finding similar expression patterns in both TCGA and GTEx. Repositioning candidates were found by identifying cancer subtypes that share phenotypes with the positively validated targets.
# Set autoreload module for dev
%load_ext autoreload
%autoreload 2
%aimport rnaseq_lib
# Imports
import pandas as pd
import rnaseq_lib as r
import holoviews as hv
hv.extension('bokeh', logo=False)
# Inputs
## Synapse ID: syn11515015
df_path = '/mnt/rnaseq-cancer/Objects/tcga-gtex-metadata-expression.tsv'
df = pd.read_csv(df_path, sep='\t', index_col=0, dtype=r.tissues.dtype)
# Holoviews object wrapper for dataframe
h = r.plot.Holoview(df)
Previous large scale comparisons of gene expression in TCGA have been done with fewer than 5,000 samples — a majority (~90%) of the samples derived from primary tumor tissue and the rest representing 'normal' tissue, some of which is taken from the same patient carrying the tumor \cite{guo_large_2013, peng_large-scale_2015-1}. In one study, the dataset consisted of 4,043 tumor samples and 548 normal tissue samples across 21 TCGA cancer types — an average of only 26 normal samples for each cancer type compared to ~200 tumor samples \cite{peng_large-scale_2015-1}. While gene expression in normal tissue is more homogeneous than tumor samples, batch effects and contamination are a common problem with RNA-seq, which complicates an already noisy data source. Additionally, the small sample sizes cannot accurately reflect the general population, and therapeutics guided by results obtained from these analyses may have a higher likelihood of failing in clinical trials. Given the lack of normals in TCGA, the second largest dataset processed in the large-scale compute are non-cancerous samples collected from GTEx, which provides valuable insights into the mechanisms of gene regulation by studying human gene expression and regulation in tissues from healthy individuals \cite{consortium_genotype-tissue_2015}.
There are a myriad of antineoplastic drugs designed to fight different types of cancer, but most only target a small population within a single subtype of cancer, which gives most patient few options. By validating known expression biomarkers for antineoplastics, these same expression motifs can be used to identify candidate subtypes of cancer that may respond to treatment.
Image('figures/Datasets.png', width=400)
The above figure, which depicts three different RNA-seq datasets containing ~20,000 samples, totals more than 110 terabytes of patient data — more data than can fit on most machines and far too much data to process efficiently on most hardware available to academics. Two of the three are cancer datasets, including The Cancer Genome Atlas (TCGA) which includes over 11,000 patients across 33 tumor types and represents the largest tumor collection of tumor data \cite{the_cancer_genome_atlas_research_network_cancer_2013-1}. GTEx contains over 8,000 samples spanning almost every tissue in the human body, with the goal providing a homogenous set of expression data representing healthy tissue.
In order to process this massive combined dataset in a timely and efficient manner, our lab developed Toil, a distributed workflow platform capable of massive scale \cite{vivian_toil_2017}. I wrote a Toil-based RNA-seq workflow which provides results that are concordant to TCGA's RNA-seq workflow, but is an order of magnitude faster and provides quantification outputs from multiple tools \cite{vivianbd2kgenomics/toil-rnaseq}.
Image('figures/toil-rnaseq.png', width=600)
The workflow was run on all 20,000 samples with a throughput of 99.6% on an Amazon Web Services cluster that peaked at 32,000 cores and 60TB of memory and finished in just under 4 days with room for improvement.
Image('figures/shared_cores.png', width=600)
While GTEx contains thousands of normal tissue samples, they can't be compared directly to TCGA due to differences in sequencing depth and laboratory batch effects. Unfortunately, there don't exist standard RNA-seq benchmark samples that every consortium uses to calibrate with before processing, which would likely introduce fewer batch effects that are easier to correct. Current available methods typically attempt naive distribution fitting that tend to work less effectively as the amount of samples and classes increases \cite{johnson_adjusting_2007,shaham_removal_2017}. Instead, we can evaluate GTEx as a prior by normalizing for sequencing depth and dispersion, then comparing differential expression results for protein-coding genes between TCGA normals and GTEx normals to see how good a prior GTEx is.
h.sample_counts()
tissues = ['Breast', 'Colon', 'Kidney', 'Liver', 'Lung', 'Prostate', 'Stomach', 'Thyroid', 'Uterus']
de_gtex = h.differential_expression_tissue_concordance(tissue_subset=tissues, tcga=False).relabel('GTEx')
de_tcga = h.differential_expression_tissue_concordance(tissue_subset=tissues, gtex=False).relabel('TCGA')
%%opts HeatMap [width=450 height=425]
(de_gtex + de_tcga).relabel('Differential Expression Gene Concordance (PearsonR)')
For almost every tissue with a sufficient number of samples, the GTEx tissue counterpart is the closest approximate to the TCGA normal. Many tissues, like Bladder, Esophagus, Pancreas, and Skin have so few TCGA normals that the GTEx counterpart not being highly concordant isn't suprising. This is corroborated by GTEx being the closest approximate for tissues with larger TCGA normal sample sizes like Breast and Kidney.
To start identifying repositioning candidates, comprehensive information on current FDA-approved antineoplastics was collated and biomarkers were identified for each drug. Drugs whose primary biomarker was expression-based were selected for further investigation.
Image('figures/Expression_Discovery_Methods.png', width=600)
Image('figures/drugs.png', width=800)
Tumor hypoxia is associated clinically with therapeutic resistance and poor patient outcomes. One feature of tumor hypoxia is activated expression of carbonic anhydrase IX (CA9), a regulator of pH and tumor growth. Disruption of the downstream bicarbonate products can acidify tumor cells and suppress tumor growth \cite{mcintyre_disrupting_2016}. Hypoxia also promotes tumour heterogeneity through the epigenetic regulation of CA9 \cite{ledaki_carbonic_2015}. CA9 is also a transmembrane protein and is stained for for use as an endogenous marker for investigating hypoxia \cite{newbold_exploratory_2009}.
CA9 is part of a family of carbonic anhydrases (zinc metalloenzymes) that catalyze reversible hydration of carbion dioxide to form carbonic acid. Girentuximab (trade name Rencarex) is a chimeric IgG1 monoclonal antibody to carbonic anhydrase IX which was granted fast track status and orphan drug designation by the FDA for renal cancer \cite{girentuximab}. In January 2017, Telix Pharmaceuticals Limited, an Australian biotechnology company, announced that it had in-licensed Girentuximab for use as a radioimmunoconjugate, iodine (124I) girentuximab, called Redectane \cite{wilexwilex}.
CA9 is reported as a ubiquitous marker in renal cell carcinoma (RCC) \cite{chen_expression_2005,turner_hypoxia-inducible_2002,kim_using_2005,wykoff_hypoxia-inducible_2000}, and should be simple to validate by examing the expression distributions of CA9.
h.gene_kde(gene='CA9', tissue_subset=['Kidney'])
CA9 is significantly upregulated with GTEx and TCGA normal samples for kidney sharing similar distributions of low expression. A comparison of Bland-Altman plots \cite{altman_measurement_1983} for kidney also show almost identical upregulation of CA9 as well as neighboring upregulated genes.
sequence = ['blue', 'red']
%%opts Scatter [color_index='label' size_index='size' width=450] (cmap=sequence)
label = {'CA9': ['CA9']}
extents = (0, -12, 20, 12)
ca9_gtex = h.tissue_de('Kidney', gene_labels=label, extents=extents).relabel('GTEx')
ca9_tcga = h.tissue_de('Kidney', gene_labels=label, tcga_normal=True, extents=extents).relabel('TCGA')
hv.Layout([ca9_gtex, ca9_tcga]).relabel('CA9 Overexpression in TCGA and GTEx')
\todo[inline]{Benedict: 62/100 of the top most DE genes are shared between Tumor/Normal and Tumor/GTEx}
# NOTE: 62 / 100 of the top most differentially expressed genes are shared
top_gtex = ca9_gtex.data.sort_values('l2fc', ascending=False).index[:100].tolist()
top_tcga = ca9_tcga.data.sort_values('l2fc', ascending=False).index[:100].tolist()
Now we examine CA9 distribution in the context of all tissues as CA9 upregulation has also been reported in colorectal, uterine cervical, hepatocellular, and bladder cancers. \cite{urquidi_vascular_2012,lee_tumor_2007,cleven_stromal_2007,bandla_comparative_2012,onishi_hypoxia_2011}.
h.gene_distribution(gene='CA9')
Every normal tissue has low expression except for esophagus, stomach, and testis. Esophagus (tcga-normal) can be ignored due to its low sample size. Stomach normal tissue appears to upregulate CA9 which is seen in literature, "Non-cancerous [gastric] tissues strongly expressed CA9 with membranous localisation" \cite{chen_expression_2005}. Now we can compare the average fold change and expression of all tissues to identify tissues with similar expression profiles for CA9.
path = [(4, 4), (9.5, 4), (9.5, 8), (4, 8), (4, 4)]
h.gene_de('CA9', extents=(1.3, -4.3, 14, 12)) * \
hv.Path([path]) * \
h.highlight_points(13.821, 11.643, color='red', size=0.4)
Kidney is an extreme outlier, but several tissues possess both high levels of expression as well as signifcant L2FC for CA9 (blue boxed area).
\todo[inline]{Get pathway citation from Rob}
Pathway information from TBD shows three upstream proteins that transcriptionally promote CA9: SP1, SP3, and the master transcriptional regulator of cellular and developmental response to hypoxia \cite{wang_hypoxia-inducible_1995}. Tumor hypoxia also instigates upregulation of the VEGF pathway (angiogenesis) and GLUT1/SLC2A1 (metabolism) downstream of CA9 \cite{chung_glut1_2009,hoskin_glut1_2003}.
%%opts HeatMap [width=450]
candidates = ['Pancreas', 'Uterus', 'Colon', 'Lung', 'Esophagus', 'Kidney']
non_candidates = list(set(df.tissue.unique()) - set(candidates))
upstream_genes = ['SP1', 'SP3', 'HIF1A', 'ARNT']
downstream_genes = ['SLC2A1', 'VEGFA', 'KDR', 'FLT1']
# Plot up / downstream genes to CA9
up_de = h.gene_de_heatmap(genes=upstream_genes, tissue_subset=candidates).relabel('Upstream')
down_de = h.gene_de_heatmap(genes=downstream_genes, tissue_subset=candidates).relabel('Downstream')
(up_de + down_de)
Kidney tumor samples significantly upregulate the downstream VEGF/GLUT1 genes. GLUT1 is overexpressed in most tumor samples, but not to the degree that the candidate tissues are.
glut_dist = pd.DataFrame()
glut_dist['l2fc'] = [3.87, 1.48, 3.37, 4.0, 5.9, 3.69] + [2.3, 2.0, 0.2, 2.2, 0.9, 1.4, -1.7, 3.1, 3.7, 2.3]
glut_dist['class'] = ['Candidates' for _ in candidates] + ['Non-Candidates' for _ in non_candidates]
hv.BoxWhisker(glut_dist, kdims=['class'], vdims=['l2fc']).relabel('GLUT1 Upregulation')
%%opts Overlay [tabs=True] BoxWhisker [width=600]
hv.Overlay([h.gene_distribution('SLC2A1', tissue_subset=[x], types=True).relabel(x) for x in sorted(candidates)])
CA9 is a ubiquitous tumor biomarker for RCC with possibilities for treatment from new immunoconjugatives derived from Girentuximab \cite{wilex}. Exploration of CA9 differential expression across cancers in TCGA and GTEx reveal significant upregulation across several cancer subtypes in addition to upregulation for other hypoxic markers. This suggests that some cancers in other tissues, like hepatocellular adenocarcinoma, may benefit from CA9-targeted drugs.
Cancer cells sometimes accrue mutations that assist them in avoiding detection by the immune system. One of these mechanisms is programmed death-ligand 1 (PD-L1; CD274) which is expressed on many cancer and immune cells and plays an important part in blocking the ‘cancer immunity cycle’ by binding programmed death-1 (PD-1; PDCD1) and B7.1 (CD80), both of which are negative regulators of T-lymphocyte activation \cite{herbst_predictive_2014}. PD-L1 is widely expressed by antigen-presenting cells (e.g., macrophages, B cells, dendritic cells) as a way of fine-tuning peripheral immune activation and avoiding autoimmunity \cite{inman_atezolizumab_2017}. There are several FDA-approved antineoplastics that work through interference of PD1/PD-L1 such as Atezolizumab, Avelumab, Durvalumab, and Nivolumab.
Atezolizumab is a humanized monoclonal immunoglobulin G1 antibody to PD-L1 that is used in cancer immunotherapy. Inhibition of PD-L1 overcomes the usual block in immune surveillance of tumor cell neoantigens and can induce remissions in several forms of advanced, metastatic cancer including urothelial (bladder and urethral) carcinoma and NSCLC \cite{rosenberg_atezolizumab_2016,balar_atezolizumab_2017,fehrenbacher_atezolizumab_2016}.
Avelumab binds PD-L1 and blocks the interaction between PD-L1 and its receptors PD-1 and B7.1. This interaction releases the inhibitory effects of PD-L1 on the immune response resulting in the restoration of immune responses, including anti-tumor immune responses and growth. Avelumab has been tested in urothelial and skin carcinomas \cite{kaufman_avelumab_2016,disis_avelumab_2015,apolo_avelumab_2017-1,gulley_avelumab_2015}.
Durvalumab is a programmed death-ligand 1 (PD-L1) blocking antibody indicated for the treatment of patients with locally advanced or metastatic urothelial carcinoma or NSCLC \cite{massard_safety_2016,antonia_durvalumab_2017,planchard_phase_2016}.
Nivolumab is a fully human immunoglobulin G4 (IgG4) monoclonal antibody that selectively inhibits programmed cell death-1 (PD-1) activity by binding to the PD-1 receptor to block the ligands PD-L1 and PD-L2 from binding. The negative PD-1 receptor signaling that regulates T-cell activation and proliferation is therefore disrupted \cite{robert_anti-programmed-death-receptor-1_2014}. This releases PD-1 pathway-mediated inhibition of the immune response, including the antitumor immune response \cite{fda125554s012lbl.pdf}. Nivolumab is used as a first line treatment for inoperable metastatic melanoma in combination with ipilimumab if the cancer does not have a mutation in BRAF, as a second-line treatment following treatment with ipilimumab and if the cancer has a mutation in BRAF, with a BRAF inhibitor \cite{van_rooij_tumor_2013,wolchok_nivolumab_2013}. It is also used as a second-line treatment for squamous non-small cell lung cancer \cite{brahmer_nivolumab_2015}, non-squamous NSCLC \cite{borghaei_nivolumab_2015}, and as a second-line treatment for renal cell carcinoma \cite{motzer_nivolumab_2015}. Recently Nivolumab has also been used to treat bladder cancer \cite{powles_mpdl3280a_2014}. Side effects of Nivolumab include severe immune-related inflammation of the lungs, colon, liver, kidneys, and thyroid, and there are effects on skin, central nervous system, the heart, and the digestive system \cite{fda125554s012lbl.pdf}.
%%opts Overlay [width=400]
PDL1 = 'CD274'
pos_tissues = ['Skin', 'Kidney', 'Lung', 'Bladder']
kdes = hv.Layout([h.gene_kde(gene=PDL1, tissue_subset=[t]).relabel(t) for t in pos_tissues])
kdes.relabel('PD-L1 Expression').cols(2)
Skin and kidney samples both have upregulation in the tumor samples compared to their normal counterparts. Bladder does as well if the TCGA-normal samples are discarded, but there are very few samples in either the GTEx or TCGA-normal dataset. Neither squamous or non-squamous NSC lung cancer samples are upregulated compared to their GTEx or TCGA-normal counterpart, but that corrobrates a recent report stating that Nivolumab didn't perform any better than chemotherapy for treating lung cancer \cite{loftus_bristol_2016}.
h.l2fc_by_perc_samples(gene=PDL1, tissue_subset=pos_tissues).relabel('PD-L1 Samples by Fold Change')
About half of renal cell carcinoma and melanoma samples have greater than log2(2) fold change relative to their GTEx counterpart, whereas lung adenocarcinoma has fewer than 10%.
h.gene_distribution(PDL1).relabel('PD-L1 Expression Across Tissues')
Aside from known positives, hepatocellular cancer, stomach adenocarcinoma, thyroid carcinoma, and testicular germ cell tumors are all differentially expressed with respect to their GTEx counterpart. Hepatocellular and testicular cancers both have no TCGA normal samples. Stomach and thyroid are both listed as side-effect tissues of Nivolumab \cite{fda125554s012lbl.pdf}. Next we identify tissues with similar differential expression motifs.
de_plot = h.gene_de(PDL1)
d = de_plot.data
# Use kidney for L2FC cutoff
tissues = d[d.L2FC >= float(d[d.Tissue == 'Kidney'].L2FC)]
# Box positive control tissues and those nearby
top = h.highlight_points(xs=tissues.Expression, ys=tissues.L2FC)
pos = [h.highlight_points(xs=d[d.Tissue==x].Expression,
ys=d[d.Tissue==x].L2FC, color='red') for x in pos_tissues]
hv.Overlay([de_plot, top] + pos)
candidates = ['Testis', 'Stomach', 'Thyroid']
h.l2fc_by_perc_samples(gene=PDL1, tissue_subset=['Testis', 'Stomach', 'Thyroid'])
Stomach, testes, and thyroid samples share similar differential expression patterns to skin, kidney, and bladder tissues, with about ~50% of tumor samples possessing a greater than log2(2) fold change. This corroborates findings for a study on gastric cancers in 2016, "In this population of patients with recurrent or metastatic PD-L1-positive gastric cancer, [anti-PD1 mAb] had a manageable toxicity profile and promising antitumour activity, warranting further study in phase 2 and 3 trials" \cite{muro_pembrolizumab_2016}. High PD-L1 expression has also been associated with poor prognosis in hepatocellular carcinoma where "Monoclonal antibodies against PD-L1 or PD-1 induced a substantial antitumor effect on murine pancreatic cancer in vivo" \cite{nomi_clinical_2007} and has been used in conjunction with other drugs that treat hepatocellular carcinoma in cell lines \cite{feig_targeting_2013}. This dataset also validates studies that show frequent upregulation of PD-L1 in testicular germ cell tumors \cite{fankhauser_frequent_2015,cierna_prognostic_2016}.
\todo[inline]{Emailed Valdo re: pathway files}
\todo[inline]{Emailed Olena re: meeting}
\todo[inline]{Need to email committee}